Progress Memo 2

Final Project
Data Science 2 with R (STAT 301-2)

Author

Cassie Lee

Published

February 26, 2024

Analysis Plan

Data Splitting

The 15,000 observations used for this prediction problem was split using a 75/25 split between training and testing data, stratified by birth weight.

x
analysis 11249
assessment 3751
n 15000
p 25

Resampling

The training data was then resampled using vfold cross validation with 4 folds and 3 repeats, so each model/workflow will be trained/fit 12 times for a metric estimate and standard error. In each iteration of the model/workflow, about 11,250 observations will be used to train and 3,750 observations will be used to obtain an estimate for the performance of the model.

Recipes

The first distinct recipe I have is a basic recipe that keeps variables as is. For the variables interval since last pregnancy and interval since last birth, there are NA values for those with plural deliveries. I changed these NA values to 0. For individuals who did not receive any prenatal care, I changed NA values to 10, essentially indicating negative amounts of prenatal care because imputing these NA values did not make sense. I removed variables that had exact linear correlations between them and variables that did not have any variance. For the recipe for linear models, I dummied all nominal predictors and scaled and centered all numeric predictors. For tree based models, I used one-hot encoding to dummy the variables.

The second distinct recipe I have operationalizes prenatal care as receiving prenatal care beginning in the first trimester or not, and creates a new variable to identify if mothers are in a higher risk age group (teenage or over 35). Interaction terms were used between starting prenatal care early and number of prenatal visits, between weight gain and pre-pregnancy weight, between plural delivery and weight gain, between cigarettes and weight gain, and between interaction between risks and number of prenatal visits for linear based models Similar to the other recipe, I removed variables wit exact linear correlations, predictors lacking any variance, and made appropariate adjustments to the recipe between recipes for linear and tree based models.

Model Types

The model types I will train/fit are elastic net, random forest, boosted tree, and a neural network model. I will use the basic similar recipes I have set up to create an initial tuned fit for each model type using resampled data. From this, I will identify the best model type.

Then, I will fit the best model with the same hyperparameters identified in the first step using the second distinct recipe type to see if it makes a difference, particularly with the inclusion of interaction terms. From this, I will decide which recipe to use.

Finally, I will fit the best model using the full training dataset.

Metrics

The metric I will use to identify the best model is RMSE. I am using RMSE over MAE partly because it is the default metric used in the tune package, but also because I would like to weigh larger errors more heavily in the performance estimate. This is because babies with a really low or really high birth weight can lead to birth complications. Thus, highly inaccurate predictions would fail to identify risks of low or high birth weight, which have significant health impacts.

Null Model and Standard Linear Regression

I have fit a null model and a standard linear regression using the basic recipe. The metrics of these models are shown below:

model .metric mean n std_err
null rmse 572.0364 12 1.196622
lm rmse 527.7231 12 1.368002

Exploratory Data Analysis

Since I had so many observations available to use (over 3 million), I used 15,000 observations for a quick exploratory data analysis. I plotted each predictor against birth weight to see if there was any variation across the categories. From this analysis, what appears to be the strongest predictors of birth weight are whether or not the birth was a plural delivery (Figure 1), the number of cigarettes smoked daily before pregnancy (Figure 2), the mother’s height (Figure 3), the mother’s pre-pregnancy weight (Figure 4), the number of prenatal visits (Figure 5), and the mother’s weight gain during pregnancy (Figure 6). Other variables such as the number of prior children who are still living had relationships with birth weight, but the effect sizes were not as large.

In my preprocessing steps, I had orginally used step_nzv() to remove predictors that had near zero variance, but realized that would remove variables such as whether or not the birth was a plural delivery. Since these variables seem to have a significant impact on birth weight, I chose to use step_zv() to ensure that predictors were not removed unless they truly did not have any variance.

Figure 1: Plural delivery
Figure 2: Daily cigarettes before pregancy
Figure 3: Mother’s height
Figure 4: Mother’s weight before pregnancy
Figure 5: Number of prenatal visits
Figure 6: Weight gain during pregnancy

Progress Summary and Potential Issues

I have started the tuning process for the elastic net and boosted tree models. Right now, the boosted tree model took a little over 2 hours to fit, so I plan to increase the number of searches in the random grid search for tuning.

The biggest issue I have right now is that I realized I cannot fit/train tree based models and charge my computer at the same time without it overheating. Because of this, I have reduced the number of observations and variables from my original analysis plan. I have also reduced the number of folds and repeats in resampling. With these reductions, I can fit the model in several hours, and my computer battery can last long enough to do so.